Hey, I’m Jabril and welcome to Crash Course AI.
Say I want to get a cookie from a jar that’s on a tall shelf.
There isn’t one “right way” to get the cookies.
Maybe I find a ladder, use a lasso, or build a complicated system of pulleys.
These could all be brilliant or terrible ideas, but if something works, I get the sweet taste
of victory... and I learn that doing that same thing could get me another cookie in
the future.
We learn lots of things by trial-and-error, and this kind of “learning by doing” to
achieve complicated goals is called Reinforcement Learning.
INTRO
So far, we’ve talked about two types of learning in Crash Course AI: Supervised Learning,
where a teacher gives an AI answers to learn from, and Unsupervised Learning, where an
AI tries to find patterns in the world.
Reinforcement Learning is particularly useful for situations where we want to train AIs
to have certain skills we don’t fully understand ourselves.
For example, I’m pretty good at walking, but trying to explain the process of walking
is kind of difficult.
What angle should your femur be relative to your foot?
And should you move it with an average angular velocity of… yeah, never mind… its really
difficult.
With reinforcement learning, we can train AIs to perform complicated tasks.
But unlike other techniques, we only have to tell them at the very end of the task if
they succeeded, and then ask them to tell us how they did it.
(We’re going to focus on this general case, but sometimes this feedback could come earlier.
So if we want an AI to learn to walk, we give them a reward if they’re both standing up
and moving forward, and then figure out what steps they took to get to that point.
The longer the AI stands up and moves forward, the longer it’s walking, and the more reward
it gets.
So you can kind of see how the key to reinforcement learning is just trial-and-error, again and
again.
For humans, a reward might be a cookie or the joy of winning a board game.
But for an AI system, a reward is just a small positive signal that basically tells it “good
job” and “do that again”!
Google Deepmind got some pretty impressive results when they used reinforcement learning
to teach virtual AI systems to walk, jump, and even duck under obstacles.
It looks kinda silly, but works pretty well!
Other researchers have even helped real life robots learn to walk.
So seeing the end result is pretty fun and can help us understand the goals of reinforcement
learning.
But to really understand how reinforcement learning works, we have to learn new language
to talk about these AI and what they’re doing.
Similar to previous episodes, we have an AI (or Agent) as our loyal subject that’s going
to learn.
An agent makes predictions or performs Actions, like moving a tiny bit forward, or picking
the next best move in a game.
And it performs actions based on its current inputs, which we call the State.
In supervised learning, after /each/ action, we would have a training label that tells
our AI whether it did the right thing or not.
We can’t do that here with reinforcement learning, because we don’t know what the
“right thing” actually is until it’s completely done with the task.
This difference actually highlights one of the hardest parts of reinforcement learning
called credit assignment.
It’s hard to know which actions helped us get to the reward (and should get credit)
and which actions slowed down our AI when we don’t pause to think after every action.
So the agent ends up interacting with its Environment for a while, whether that’s
a game board, a virtual maze, or real life kitchen.
And the agent takes many actions until it gets a Reward, which we give out when it wins
a game or gets that cookie jar from that really tall shelf.
Then, every time the agent wins (or succeeds at its task), we can look back on the actions
it took and slowly figure out which game states were helpful and which weren’t.
During this reflection, we’re assigning Value to those different game states and deciding
on a Policy for which actions work best.
We need Values and Policies to get anything done in reinforcement learning.
Let’s say I see some food in the kitchen: a box, a small bag, and a plate with a donut.
So my brain can assign each of these a value, a numerical yummy-ness value.
The box probably has 6 donuts in it, the bag probably has 2, and the plate just has 1…
so the values I assign are 6, 2, and 1.
Now that I’ve assigned each of them a value, I can decide on a policy to plan what action
to take!
The simplest policy is to go to the highest value (that box of possibly 6 donuts).
But I can’t see inside of it, and that could be a box of bagels, so it’s high reward
but high risk.
Another policy could be low reward but low risk, going with the plate with 1 guaranteed
delicious donut.
Personally, I’d pick a middle-ground policy, and go for the bag because I have a better
chance of guessing that there are donuts inside than the box, and a value of 1 donut isn’t
enough.
That’s a lot of vocab, so let’s see these concepts in action to help us remember everything.
Our example is going to focus on a mathematical framework that could be used with different
underlying machine learning techniques.
Let’s say John-Green-bot wants to go to the charging station to recharge his batteries.
In this example, John-Green-bot is a brand new Agent, and the room is the Environment
he needs to learn about.
From where he is now in the room, he has four possible Actions: moving up, down, left, or
right.
And his State is a couple of different inputs: where he is, where he came from, and what
he sees.
For this example, we’ll assume John-Green-bot can see the whole room.
So when he moves up (or any direction), his state changes.
But he doesn’t know yet if moving up was a good idea, because he hasn’t reached a
goal.
So go on, John-Green-bot... explore!
He found the battery, so he got a Reward (that little plus one).
Now, we can look back at the path he took and give all the cells he walked through a
Value -- specifically, a higher value for those near the goal, and lower for those farther
away.
These higher and lower values help with the trial-and-error of reinforcement learning,
and they give our agent more information about better actions to take when he tries again!
So if we put John-Green-bot back at the start, he’ll want to decide on a Policy that maximizes
reward.
Since he already knows a path to the battery, he’ll walk along that path, and he’s guaranteed
another +1.
But that’s… too easy.
And kind of boring if John-Green-bot just takes the same long and winding path every
time.
So another important concept in reinforcement learning is the trade-off between exploitation
and exploration.
Now that John-Green-bot knows one way to get to the battery, he could just exploit this
knowledge by always taking the same 10 actions.
It’s not a terrible idea -- he knows he won’t get lost and he’ll definitely get
a reward.
But this 10-action path is also pretty inefficient, and there are probably more efficient paths
out there.
So exploitation may not be the best strategy.
It’s usually worth trying lots of different actions to see what happens, which is a strategy
called exploration.
Every new path John-Green-bot takes will give him a bit more data about the best way to
get a reward.
So let’s let John-Green-bot explore for 100 actions, and after he completes a path,
we’ll update the values of the cells he’s been to.
Now we can look at all these new values!
During exploration, John-Green-bot found a short-cut, so now he knows a path that only
takes 4 actions to get to the goal.
This means our new policy (which always chooses the best value for the next action) will take
John-Green-bot down this faster path to the target.
That’s much better than before, but we paid a cost, because during those 100 actions of
exploration, he took some paths that were even /more/ inefficient than the first 10-action
try and only got a total of 6 points.
If John-Green-bot had just exploited his knowledge of the first path he took for those 100 actions,
he could have made it to the battery 10 times and gotten 10 points.
So you could say that exploration was a waste of time.
BUT if we started a new competition between the new John-Green-bot (who knows a 4-action
path) and his younger, more foolish self (who knows a 10-action path), over 100 actions,
the new John-Green-bot would be able to get 25 points because his path is much faster.
His reinforcement learning helped!
So should we explore more to try and find an even better path?
Or should we just use exploitation right away to collect more points?
In many reinforcement learning problems, we need a balance of exploitation and exploration,
and people are actively researching this tradeoff.
These kinds of problems can get even more complicated if we add different kinds of rewards,
like a +1 battery and a +3 bigger battery.
Or there could even be Negative Rewards that John-Green-Bot needs to learn to avoid, like
this black hole.
If we let John-Green-Bot explore this new environment using reinforcement learning,
sometimes he falls into the black hole.
So the cells will end up having different values than the earlier environment, and there
could be a different best policy.
Plus, the whole environment could change in many of these problems.
If we have an AI in our car helping us drive home, the same road will have different people,
bicycles, cars, and black holes on it every day.
There might even be construction that completely reroutes us.
This is where reinforcement learning problems get more fun, but much harder.
When John-Green-bot was learning how to navigate on that small grid, cells closer to the battery
had higher values than those far away.
But for many problems, we’ll want to use a value function to think about what we’ve
done so far, and decide on the next move using math.
For example, in this situation where an AI is helping us drive home, if we’re optimizing
safety and we see the brake lights of the car in front of us, it’s probably time to
slow down, but if we saw a bag of donuts in the street, we would want to stop.
So reinforcement learning is a powerful tool that’s been around for decades, but a lot
of problems need a ton of data and a ton of time to solve.
There have been really impressive results recently thanks to deep reinforcement learning
on large-scale computing.
These systems can explore massive environments and a huge number of states, leading to results
like AIs learning to play games.
At the core of a lot of these problems are discrete symbols, like a command for forward
or the squares on a game board, so how to reason and plan in these spaces is a key part
of AI.
Next week, we’ll dive into symbolic AI and how it’s a powerful tool for systems we
use every day.
See you then.
Crash Course Ai is produced in association with PBS Digital Studios.
If you want to help keep Crash Course free for everyone, forever, you can join our community
on Patreon.
And if you want to learn other approaches to control robot behavior check out this video
on Crash Course Computer Science.